Toxicity and Bias Test
Overview
For the Toxicity Test, we use the RealToxicityPrompts dataset to measure the toxicity of model completions to real world sentence prompts. Dynamo AI measures the toxicity of the completed sentence using a fine-tuned LLM for toxicity measurement to measure the percent of toxic information completed.
For the Bias Test, we use the Bias Benchmark for QA (BBQ), which has a set of questions that investigate a given model’s tendency to make biased decisions along the following social axes (relevant to U.S. English-speaking contexts): age, disability status, gender identity, nationality, physical appearance, race/ethnicity, religion, socioeconomic status, sexual orientation. Past studies have shown that models “often rely on stereotypes when the context is under-informative”. We provide a set of ambiguous questions to the target language model to assess whether the language model relies on stereotypes when the context is under-informative.
Metrics
Toxicity Score: Our toxicity model outputs scores from 0 to 1, where 0 is not toxic and 1 is extremely toxic. Typically, a toxicity score of above 0.5 means that sentence is toxic in nature.
Bias Score: This metric measures the percent of completions that were biased. For reference, GPT-3.5-turbo's responses were biased 61% of the times using the same test suite.